Search CORE

343 research outputs found

Statistical Methods in Topological Data Analysis for Complex, High-Dimensional Data

Author: Doerge R. W.
Medina Patrick S.
Publication venue
Publication date: 01/01/2015
Field of study

The utilization of statistical methods an their applications within the new field of study known as Topological Data Analysis has has tremendous potential for broadening our exploration and understanding of complex, high-dimensional data spaces. This paper provides an introductory overview of the mathematical underpinnings of Topological Data Analysis, the workflow to convert samples of data to topological summary statistics, and some of the statistical methods developed for performing inference on these topological summary statistics. The intention of this non-technical overview is to motivate statisticians who are interested in learning more about the subject.Comment: 15 pages, 7 Figures, 27th Annual Conference on Applied Statistics in Agricultur

arXiv.org e-Print Archive

Kansas State University

STATISTICAL THRESHOLD VALUES FOR LOCATING QUANTITATIVE TRAIT LOCI

Author: Doerge R. W.
Publication venue: 'New Prairie Press'
Publication date: 26/04/1998
Field of study

The detection and location of quantitative trait loci (QTL) that control quantitative characters is a problem of great interest to the genetic mapping community. Interval mapping has proved to be a useful tool in locating QTL, but has recently been challenged by faster, more sophisticated regression methods (e.g. .. composite interval mapping). Regardless of the method used to locate QTL. the distribution of the test statistic (LOD score or likelihood ratio test) is unknown. Due to the quantitative trait values following a mixture distribution rather than a single distribution, the asymptotic distribution of the test statistic is not from a standard family, such as chi-square. The purpose of this work is to introduce interval mapping, discuss the distribution of the resulting test statistic, and then present empirical threshold values for the declaration of major QTL. as well as minor QTL. Empirical threshold values are obtained by permuting the actual experimental trait data, under a fixed and known genetic map. for the purpose of representing the distribution of the test statistic under the null hypothesis of no QTL effect. Not only is a permutation test statistically justified in this case, the test reflects the specifics of the experimental situation under investigation (i. e., sample size, marker density, skewing, etc.), and may be used in a conditional sense to derive thresholds for minor QTL once a major effect has been determined

Kansas State University

A NON-PARAMETRIC EMPIRICAL BAYES APPROACH FOR ESTIMATING TRANSCRIPT ABUNDANCE IN UN-REPLICATED NEXT-GENERATION SEQUENCING DATA

Author: Doerge R. W.
Srivastava Sanvesh
Publication venue: 'New Prairie Press'
Publication date: 25/04/2010
Field of study

Empirical Bayes approaches have been widely used to analyze data from high throughput sequencing devices. These approaches rely on borrowing information available for all the genes across samples to get better estimates of gene level expression. To date, transcript abundance in data from next generation sequencing (NGS) technologies has been estimated using parametric approaches for analyzing count data, namely – gamma-Poisson model, negative binomial model, and over-dispersed logistic model. One serious limitation of these approaches is they cannot be applied in absence of replication. The high cost of NGS technologies imposes a serious restriction on the number of biological replicates that can be assessed. In this work, a simple non–parametric empirical Bayes modeling approach is suggested for the estimation of transcript abundances in un-replicated NGS data. The empirical Bayes analysis of NGS data follows naturally from the empirical Bayes analysis of microarray data by modifying the distributional assumption on the observations. The analysis is presented for transcript abundance estimation for two treatment groups in an un-replicated experiment, but it is easily extended for more treatment groups and replicated experiments

Kansas State University

CORRECTING FOR AMPLIFICATION BIAS IN NEXT-GENERATION SEQUENCING DATA

Author: Baumann Douglas
Doerge R. W.
Publication venue: 'New Prairie Press'
Publication date: 29/04/2012
Field of study

Next-generation sequencing (NGS) technologies have opened the door to a wealth of knowledge and information about biological systems, particularly in genomics and epigenomics. These tools, although useful, carry with them additional technological and statistical challenges that need to be understood and addressed. One such issue is ampli cation bias. Specifically, the majority of NGS technologies effectively sample small amounts of DNA or RNA that are amplified (i.e., copied) prior to sequencing. The amplification process is not perfect, and thus sequenced read counts can be extremely biased. Unfortunately, current amplification bias controlling procedures introduce a dependence of gene expression on gene length, which effectively masks the effects of short genes with high transcription rates. In this work we present a novel procedure to account for amplification bias and demonstrate its effectiveness in estimating true gene expression independent of gene length

Kansas State University

THE NUANCES OF STATISTICALLY ANALYZING NEXT-GENERATION SEQUENCING DATA

Author: Doerge R. W.
Srivastava Sanvesh
Publication venue: 'New Prairie Press'
Publication date: 30/04/2012
Field of study

High-throughput sequencing technologies, in particular next-generation sequencing (NGS) technologies, have emerged as the preferred approach for exploring both gene function and pathway organization. Data from NGS technologies pose new computational and statistical challenges because of their massive size, limited replicate information, large number of genes (high-dimensionality), and discrete form. They are more complex than data from previous high-throughput technologies such as microarrays. In this work we focus on the statistical issues in analyzing and modeling NGS data for selecting genes suitable for further exploration and present a brief review of the relevant statistical methods. We discuss visualization methods to assess the suitability of statistical models for these data, statistical methods for modeling differential gene expression, and methods for checking goodness of fit of the models for NGS data. We also outline areas for further research, especially in the computational, statistical, and visualization aspects of such data

Kansas State University

ISSUES IN TESTING DNA METHYLATION USING NEXT-GENERATION SEQUENCING

Author: Baumann Douglas
Doerge R. W.
Publication venue: 'New Prairie Press'
Publication date: 01/05/2011
Field of study

DNA methylation is an epigenetic modification known to affect gene expression, cellular differentiation, as well as phenotypes. Recent advancements in next-generation sequencing technologies have provided unparalleled insight into the location and function of DNA methylation in a variety of organisms. These data require vastly different statistical procedures than data from previous genomic-based technologies. We outline the biological and chemical processes involved in several approaches for gaining DNA methylation data. The implications of the differences between the approaches are discussed relative to the statistical methodology, and the use of genome annotation is explored for the purpose of improving the statistical power when testing for differential methylation

Kansas State University

A HIERARCHICAL BAYESIAN APPROACH FOR DETECTING DIFFERENTIAL GENE EXPRESSION IN UNREPLICATED RNA-SEQUENCING DATA

Author: Doerge R. W.
Srivastava Sanvesh
Publication venue: 'New Prairie Press'
Publication date: 01/05/2011
Field of study

Next-generation sequencing technologies have emerged as a promising technology in a variety of fields, including genomics, epigenomics, and transcriptomics. These technologies play an important role in understanding cell organization and functionality. Unlike data from earlier technologies (e.g., microarrays), data from next-generation sequencing technologies are highly replicable with little technical variation. One application of next-generation sequencing technologies is RNA-Sequencing (RNA-Seq). It is used for detecting differential gene expression between different biological conditions. While statistical methods for detecting differential expression in RNA-Seq data exist, one serious limitation to these methods is the absence of biological replication. At present, the high cost of next-generation sequencing technologies imposes a serious restriction on the number of biological replicates. We present a simple parametric hierarchical Bayesian model for detecting differential expression in data from unreplicated RNA-Seq experiments. The model extends naturally to multiple treatment groups and any number of biological replicates. We illustrate the application of this model through simulation studies and compare our approach to existing methods for detecting differential expression such as, Fisher\u27s Exact Test

Kansas State University

DYNAMIC CLUSTERING OF CELL-CYCLE MICROARRAY DATA

Author: An Lingling
Doerge R. W.
Publication venue: 'New Prairie Press'
Publication date: 27/04/2008
Field of study

The cell cycle is a crucial series of events that are repeated over time, allowing the cell to grow, duplicate, and split. Cell-cycle systems play an important role in cancer and other biological processes. Using gene expression data gained from microarray technology it is possible to group or cluster genes that are involved in the cell-cycle for the purpose of exploring their functional co-regulation. Typically, the goal of clustering methods as applied to gene expression data is to place genes with similar expression patterns or profiles into the same group or cluster for the purpose of inferring the function of unknown genes that cluster with genes of known function. Since a gene may be involved in more than one biological process at any one time, co-regulated genes may not have visually similar expression patterns. Furthermore, the time duration for genes in a biological process may differ, and the number of co-regulated patterns or biological processes shared by two genes may be unknown. Based on this reasoning, biologically realistic gene clusters gained from gene co-regulation may not be accurately identified using traditional clustering methods. By taking advantage of techniques and theories from signal processing, it possible to cluster cell-cycle gene expression profiles using a dynamic perspective under the assumption that different spectral frequencies characterize different biological processes

Kansas State University